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Chapter  I 


A  General  Discussion  of  Correlation 


A. 


A  General  Definition  of  Correletion 


When  a  series  of  measures  in  statistics  is  studied,  it  is,  in  general, 
desirable  to  determine  three  points:  (l)  the  computation  of  some  average, 
the  mean,  median,  or  mode  to  represent  the  series;  (2). the  picturing  of 
the  degree  of  concentration  by  obtaining  a  measure  of  the  dispersion; 
(3)  the  graphic  picture  of  the  distribution  by  plotting  the  smoothed 
frequency  curve.    T'^hen  two  or  more  series  are  to  be  compared,  it  is  often 
necessary  to  find  some  method  of  determining  the  relationship  between  the 
series.     This  relationship  is  called  correlation. 

Supnose  we  consider  a  hypothetical  case  in  discussing  the  correletion 
between  marks  given  to  a  class  of  twenty  pupils  in  plane  geometry  and  English. 

School  Marks  Given  a  Class  of  Twenty  Pupils  in  Plane  Geometry  and  English 


Pupils 


Average  Marks  in     Average  Marks 
Plane  Geometry         in  English 


Rank  in  Rank  in 

Achievement  in  Achievement  in 
Plane  Geometry  English 


A 
P 
C 
D 
E 
F 
G 
H 
I 
J 
K 
L 
M 
N 
0 
P 

0 
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50 
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60 
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20 
17 
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19 
6 
1 
5 
16 
8 
11 
3 
14 
10 
12 


18 
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The  graph  would  show  readily  that  a  pupil  who  stood  high  in  plane 
geometry  would  also  receive  high  marks  in  English  and  if  low  in  geometry  was 
also  low  in  English.     This  graphic  method  would  give  a  fair  idea  of  the  re- 
lationship existing  between  the  two  series  but  would  not  give  an  exact 
numerical  expression  nor  would  it  give  an  expression  which  would  summarize 
the  situation. 

The  problem,  then,  is  to  find  some  device  which  would  yield  a  numerical 
expression  that  would  completely  describe  the  relation  existing  between  the 
two  series.  This  numerical  expression  is  called  the  "coefficient  of  corre- 
lation." The  coefficient  of  correlation  is  the  term  used  generally  in 
statistics  to  refer  to  the  one  obtained  by  the  product-moment  method  and  is 
designated  by  "r".  It  is  an  index  of  linear  correlation  which  type  will  be 
discussed  in  this  paper. 

The  series  in  Fig.  I  form  an  example  of  linear  correlation  because  the 
points  tend  to  form  a  straight  band  across  the  graph.     If  this  were  perfect 
linear  correlation,  the  points  would  lie  on  a  straight  line.     Thus  we  might 
define  correlation  as  the  "tendency  for  two  observed  variables  to  be  related 
in  the  form  of  a  single-valued  mathematical  function." 

The  product -moment  formula  will  be  developed  later,  but  a  brief  ex- 
planation of  it  may  aid  in  stating  what  is  meant  by  correlation.     A  measure 

/A  t 

of  relationship  between  the  variables 
might  be  obtained  by  considering  prod- 
ucts of  the  deviations  from  the  means 
expressed  in  terms  of  the  standard 
deviations.     Let  X  -        ~  a^d 

f  J-  Mr  ^         1  • 

then  we  could  sum  the  N  pairs  of  x 
oroducts '. 
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Holzinger 

"Statistical  Methods  in  Education" 
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and  divide  by  Iff*      The  result  ^~^rfL  *s  presented  by  r  and  is  called  the 
product-moment  formula  for  correlation.     It  will  be  shown  later  that  r  may 
vary  from  -1  (perfect  negative  correlation)  thru  0  (lack  of  correlation) 
to -f  1  (perfect  positive  correlation). 

B.     The  Correlation  Table  and  Correlation  Surface. 
If  we  were  to  consider  the  following  oroblem  to  find  the  coefficient  of 
correlation  between  the  two  series, 

(l)  Heights  in  inches  of  Glasgow  school  boys,  ages  4.5  to  5.5  years, 
and    (2)  '"eights  in  pounds  of  these  same  boys, 
the  work  would  be  arranged  in  a  double  entry  table  called  a  correlation 
table.     In  this  table  the  frequencies  are  thought  of  as  being  concentrated 
at  the  midpoint  of  the  class  intervals;  that  is,  the  weights  are  divided 
into  class  intervals  as  follows: 

24-28,  28-33,  34-38,  etc.,  with  26,  31,  36,  etc.,  the  midooints. 


Height 
in  Inches 


height  in  Pounds 
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"Mathematical 
Analysis  of 
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If  we  thought  of  a  three  dimensional  coordinate  system  in  which  the 
x  axis  was  the  mean  of  the  heights,  the  y  axis  the  mean  of  the  weights, 
and  let  the  z  coordinates  be  the  f  requencies#,  and  if  we  passed  a  smooth 
surface  through  the  ooints  thus  determined,  we  would  get  what  is  known  as 
a  correlation  surface.     If  the  correlation  table  were  symmetrical,  we 
would  find  that  the  correlation  surface  was  a  normal  surface;  th*t  is,  a 


c 


bell-shaped  surface  with  the  z  axis  the  centroid  vertical. 

C.    Methods  of  Approach  to  the  Problem  of  Correlation. 
Rietz  "Mathematical  Statistics"    P.  77 

There  are  two  methods  of  approach  to  the  problem  of  correlation:  one 
is  the  "regression"  method,  the  other  the  "correlation  surface"  method. 

The  Regression  Method. 

If  we  consider  associated  values  of  x  and  y  as  plotted  in  a  scatter 
diagram  and  separate  the  dots  into  classes  by  selecting  class  intervals 
dx  and  dy,  the  y's  corresponding  to  any  class  dx  are  called  an  array  of 
y's  and  similarly  the  values  of  x  corresponding  to  any  interval  dy  are 
called  an  x-array.       The  regression  curve  y  ■  f(x)  is  defined  as  the  locus 
of  the  expected  value  of  y  in  the  array  which  corresponds  to  an  assigned 
value  of  x  as  dx  approches  zero;  that  is,  the  regression  curve  of  y  on  x 
is  the  locus  of  the  means  of  the  arrays  of  y's  as  dx  approaches  zero. 
Similarly  the  regression  curve  of  x  on  y  is  the  locus  of  the  means  of  the 
arrays  of  x' s  as  dy  approaches  zero.    Having  found  the  regression  curves 
of  y  on  x  and  x  on  y,  we  are  now  interested  in  the  distribution  of  the 
values  of  y  whose  average  we  have  predicted.     This  is  accomplished  by 
measuring  the  dispersion  of  the  values  of  y  which  correspond  to  an  assigned 
value  of  x.     In  other  words,  we  wish  to  know  the  average  standard  deviation 
of  a  row  about  the  line  which  represents  the  locus  of  the  means  of  the  rows 
and  also  the  average  standard  deviation  of  a  column  about  the  line  which  re- 
presents the  locus  of  the  means  of  the  x-arrays. 

To  illustrate  the  regression  method  we  might  consider  a  problem  of 
correlating  the  marks  of  a  class  in  geometry  and  of  the  same  class  in  English. 
We  would  first  find  a  means  of  predicting  the  mean  mark  of  a  sub-group  in  the 
geometry  class  which  had  received  identical  marks  in  English,  then  we  would 
find  a  measure  to  predict  the  dispersion  of  such  a  subgroup. 
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The  Correlation  Surface  Method 
In  this  method,  we  attempt  to  determine  the  probability,  <^(*/^Jdx  dy, 
that  a  pair  of  associated  values  of  x  and  y  will  fall  into  the  rect- 

angular area  bounded  by  (x-f  dx)  and  (y^-dy).     If  m  (x)  is  such  that  m(x)dx 
gives,  to  within  infinitesimals  of  higher  order,  the  probability  that  any 
x  lies  between  x  and  (x-/-dx)  end  n(x,y)dy  gives  the  probability  that  any  y 
taken  from  the  array  which  corresponds  to  the  x  chosen  above  will  lie  between 
y  and  (yfdy)  then  the  Probability  that  both  will  happen  is 

(p  (*,  j)  d#       r    yvi  (x)  7i  (**f)  dp  cLy , 

We  are  thus  able  to  set  up  the  equation  for  the  frequency  surface 
^  -  ^YV/^J  an(i  by  a  study  of  this  to  arrive  at  the  coefficient  of  correlation 
between  x  and  y. 
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Chapter  II  A  Definition  of  Correlation  and  an  Introduction  to 

the  Regression  TvTethod. 

Having  given  a  general  idea  of  the  problem  we  might  now  define  corre- 
lation and  then  indicate  how  we  would  approach  a  solution  by  way  of  the  re- 
gression method. 

Definition:     (Prof.  Dow  in  a  course  in  Statistics)     "A  quantity  is 
said  to  be  correlated  with  another  quantity  if  to  any  value  of  the  one 
quantity  there  exists  a  probable  value  of  the  other  quantity  and  more 
exactly  we  shall  call  x  and  y  correlated  if  when  any  particular  values  of 
y  are  selected,  the  average  value  of  the  corresponding  x  is  thereby  deter- 
mined." 

If  we  consider  a  problem  like  that  given  on  page  4  and  set  up  a  graph 
in  which  we  plotted  only  the  mean  value  of  the  heights  corresponding  to  any 
given  weight,  we  would  have  a  graph  like  the  one  pictured  on  page  9. 
The  dots  represent  the  mean  values  of  the  heights  corresponding  to  the 
actual  values  of  the  weight..     The  line  0,  ^e«t.  the  /  axis  at  the  mea„ 
of  the  heights  and  Q         cuts  the  X  axis  at  the  mean  of  the  weights.  The 
line  CC7  is  fitted  by  inspection  as  being  the  line  of  the  best  fit  which 
corresponds  to  the  actual  line  of  the  means.     This  line  serves  the  purpose 
of  a  generalized  trend  of  the  points.     Since  CCX  is  the  line  of  best  fit  of 
the  means  of  the  columns,  it  must  pass  through  O,   ,  the  mean  point  of  the 
entire  distribution  and  if  B  ;  (x;y),  a  point  on  the  line,  is  taken  so  that 
x  and  y  represent  the  deviations  of  this  point  from  the  means  Mx  and  My  , 
then  the  slone  of  this  line  is  X  or  is  the  deviation  of  the  noint  from  the 
mean  of  the  J's  divided  by  the  deviation  of  the  point  from  the  mean  of  the 
_5T's.     Since  the  slope  is  always  the  same  and  since  the  line  passes  through 
Q  ,  we  may  consider  this  the  origin  and  write  the  equation  of  Cc':  y  =  mx. 
Now  if  we  find  by  measurement  the  x  and  y  value  of  any  one  point,  we  may  find 


m  and  thus  write  the  equation  of  the  line     CC  .     The  difficulty  here  is  that 
y  is  measured  in  inches  and  x  is  measured  in  oounds.     This  may  be  cared  for 
by  dividing  each  by  its  standard  deviation  which  measures  the  variation  of 

U      f  y 

each  series  about  its  mean.     Therefore,  if  we  consider  the  ratio  —zzr~  —  ~zr 

(7^  ^ 

we  have  a  measure  of  the  degree  of  relationship  showing  the  trend  of  the 

veriations  of  the  x's  and  y's.     This  ratio  we  define  as  r,  the 

x 

coefficient  of  correlation.  ^ 

u  X  / 

Therefore,  we  write  A-  -  1>  ~ZZ~      as  the  equation  of  the  line     CC  . 

Similarly  if  we  found  the  line  best  fitting  the  means  of  the  rows  and  called 

/  '  x 

it  RR    ,  we  could  show  the  equation  of     RR      to  be  ^-    r         ~Zf^~  in  the 

same  way. 

Having  exolained  the  meaning  of  correlation  in  terms  of  these  equations, 
we  shall  now  attempt  to  develop  them  in  a  more  rigorous  manner. 
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Chapter  III 


Development  of  the  Regression  Lines. 


A.      An  Indirect  Geometrical  Approach 

"Introduction  to  Theory  of  Statistics"  Yule, 

P.  170  ff . 

o  s*> 


M 


c  ^ 

* 

c' 


Suppose  we  had  a  distribution  in  which  all  the  means  of  the  rows  were 
on  the  line  RR  '  and  let  x  be  the  deviations  of  the  rows  from  M,y  and  y    be  the 
devietions  from      x.     Call  the  slope  of  the  line  RR    to  My,  b(;  the  equation 
of  Rp/   is  then  x  =  b(  y.     Then  in  any  row  of  type  y  in  which  the  number  of 
observations  is  n,  —  the  deviation  of  the  mean  point  of  that  row. 

Since  that  point  is  on  RR  '  ,  we  may  now  rewrite  the  equation  of  RR7: 

=    hi  y  or        i  (X)  ~  ^  6,  Lj, 

If  we  consider  this  for  the  entire  distribution,  we  write       X  -    b,  Jl  y.) 
where ^  (x)  is  the  sum  of  the  deviations  of  all  the  X's  from  My  and£(ny) 
is  the  sum  of  the  deviations  of  all  them's  from  But  ^^j^-^because 

M^x  is  the  mean  of  all  the  J  «s.       ,\     £  (_  x)   ~   b,    ^  (  * f)  -  O 
Since^.( x)  =  0,  the  sum  of  the  deviations  of  all  the/  '  s  from  My  is  zero, 
so  My  must  be  the  mean  of  all  the  X's  and  must  cut       at  M.  ,  the  mean  ofX. 
In  this  way  we  have  shown  M.  to  be  the  mean  of  the  entire  distribution. 


0 
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Now  RR    passes  through  M,  a  point  which  we  may  locate  for  any  distribu- 
tion.    Therefore,  to  write  the  equation  for  the  line  we  have  only  to  determine 
b,  its  slooe. 

If  we  define  f°  =       ^  ^'f)  the  mean  product  of  all  associated  deviations 
of  x  and  y,  then  for  any  row  of  type  y  we  have,  since  y  is  a  constant 

y)  =  y#x)       =      y  n  b,  y      =      n  b  y* 
For  all  the  rows  we  have    £  (X<q)  =  *>,  4-  (**  ~        ,  i  C*i>  J 

but  =r   07^  so  "  T  * 

.'.'it*'})*"***?      andas  ;  Mr*  "h  * 

Reasoning  in  a  similar  manner,  if  the  means  of  the  columns  all  lie  on 

1  .  P 

CC    and  if  b^  represents  the  slope  of  CC    to  the  horizontal      ~*  ~     (j—  z. 


Also  we  may  show  that  CC    passes  through  M. 


Hence  the  equations  for  RR    and  CC  are 

rr7/      x~    ~7r^  % 

x 

The  forms  of  these  equations  ere  not  suitable  for  calculation,  so  we 

p 

must  rewrite  them.     If  we  set  r  =  we  introduce  the  usual  notation  for 

Ox  Off. 

the  coefficient  of  correlation  and  write: 

These  are  the  same  equations  we  arrived  at  in  our  general  discussion  above. 


In  this  discussion  we  assumed  that  the  means  of  the  rows  would  fall  on 
RR      and  the  means  of  the  columns  would  fall  on  CC  .    We  must  now  consider 
the  more  usual  situation  where  this  does  not  occur. 

If  the  values  of  x  and  y  (the  deviations  f rom       and  m  x  )  be  found  for 
all  associated  pairs  of  values,  then  we  find: 


c 


and  where  x  is  the  actusl  deviation  from  the  mean  and  b,  y  is  the  estimated 
deviation. 
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Proof. 

4 


(i)  (x~  ^-a^'r^' 

(3)  x 

(4)  £  6,  #  A  #•  V  *  ^ 

(5)  r   n  <rx"  -  *  ^ 

(6)  5    ^(77^-  *  ^ 

(7)  i/r-Vj^  a  „  07-0-^)  ^ 

p. 

Now  if  b.  equals  any  other  value  such  as    /->  ~  Ou -h 


B. 
Proof. 


then 


(1)  £{x-i>.f)\    *  LY  + 

(2)  _      £  fx*-  2*u(v+  D  +  0u+  D  VL^7 


(3) 


(7)  ^cx~J^^,^  rrx-  ;" 


Now  we  showed  when       ^/  -  -7*=- 

and  when         =  (^t  +-  %*) 
The  right-hand  side  of  B  is  obviously  greater  than  the  right-hand  side  of  A; 


c 
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so    Z_  (x  -  b;y)     is  a  minimum  when  b^  =  r 


Let  us  consider    the  distribution  in  a  row  of  type  y  with  the  origin  at  K 
and  find  the  root  mean-square  deviation  of  the  row  about  point  T  (on  RR^). 
See  Note  ^1,  Page  1  of  Footnotes. 

The  root  mean-square  deviation  of  the  row  about  T  is  ^  ( *  ~  tj) 

where  n  is  the  frequency  of  the  row. 

where  Sc^y1c  is  the  standard  deviation  of  the  row. 
So  for  the  row     i  ( X  ~  6,  ^f*  -  t\  S  +-  <w  ct 

For  the  entire  distribution  then 

is  the  sum  of  the  standard  deviations  of  the  rows  and  remains 
unchanged  regardless  of  the  slope  of  RR  ,  so  d  and  (x-b#y)  are  the  only  terms 
affected  by  Rr'. 

Now  if  1-  C  *  ~  &f  i^)    is  a  minimum,  so  be  a  minimum  for  the 


value  of  b;  =  r 


r 


c 
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This  says  that  the  sum  of  the  squares  of  the  distances  of  the  means  of 
the  rows  from  RR    (each  multiplied  by  the  frequency  of  that  row)  is  the 

lowest  possible  when  b.    =  r  — r~  • 

'  °\ 

i 

The  same  can  be  proved  in  like  manner  for  CC   ;  that  is,  the  sum  of  the 

i 

sauares  of  the  distances  of  the  means  of  the  columns  from  CC    is  the  lowest 

<Tu  ' 
possible  when  b0  *  r  — £  and  the  equation  of  CC    is  y  •  b,  x. 

Therefore,  the  equations 


^*      and      y  •  r 

"(a)  equations  for  estimating  each  individual  x  from  its  associated  y 


x    -    r    *_     and      y  ■  r    Jfc-        may  be  regarded  as 


(and  y  from  its  associated  x)  in  such  a  way  as  to  make  the  sum  of  the  squares 
errors  of 

of/estimate  the  least  possible,  or  (b)  equations  for  estimating  the  mean  of 
the  x' s  associated  with  a  given  type  of  y  (and  the  mean  of  the  y's  associated 
with  a  given  type  of  x)  in  such  a  way  as  to  make  the  sum  of  the  squares  of 
errors  of  estimate  the  least  possible  when  every  mean  is  counted  once  for 
each  observation  on  which  it  is  based. 

These  lines  are  called  the  lines  of  "best  fit"  of  the  actual  lines  of 
the  means. 


f 


c 


■ 


T.Ve  might  now  clarify  the  reasoning  in  the  above  by  attacking  the  problem 
in  a  brief  direct  discussion. 

B.    An  Analytical  Approach  to  the  Regression  Lines 

("A  First  Course  in  Statistics",  Jones    P.  104  f f . ) 

If  we  considered  the  correlation  table  on  page  4  and  plotted  the  mean 
values  of  y  corresponding  to  each  x  as  on  page  9,  we  would  note  that  as  x 
increased  y  would  tend  to  increase .    We  also  note  that  as  we  plot  the  points 
(x^y,  ),  (x^y^)  etc.,  they  would  tend  to  cluster  about  a  straight  line.  If 
we  write  the  equation  of  the  lines  which  would  best  fit  the  points,  it  is 
^  =  mx-f-c.     The  problem  before  us  is  to  determine  the  constants  m  and  c  so 
that  we  may  write  this  equation.     If  we  can  do  this,  we  will  be  able  to  find 
the  best  average  value  of  y  corresponding  to  any  x. 

Now  yt  ,  tya  ,  ^3  etc.,  were  the  best  values  of  y  corresponding  to     xA,  X5 
etc.,  so  if  we  rewrite  the  equation  y  =  mx +  c  we  will  be  still  estimating 
the  best  y  corresponding  to  any  given  x  and  basing  our  work  on  all  the 
observations  since  y# is  the  best  value  of  y  in  that  particular  column. 

If  x  =  x, 

y  =  m  x; +  c 

But  for  any  value  x,  of  x  there  may  be  several  values  of  y  as  seen  in  the 
correlation  table  on  page  4;  if  y,  is  one  of  these  values,  the  difference 
between  it  and  the  value  given  by  the  equation  is  (mXj-f  c)  -  y(  . 

This  difference  measures  the  distance  measured  parallel  to  the  y  axis 
between  the  observed  point  (x;,  yf  )  and  the  line  y  =  mx  f  c.  'Ve  now  wish  to 
find  the  equation  of  a  line  such  that  the  sum  of  these  differences  for  all 
paired  values  of  x  and  y  will  be  a  minimum.  Since  some  of  these  differences 
are  positive  and  some  negative,  we  will  search  for  the  equation  which  will 
make  the  sum  of  their  squares  a  minimum.  The  problem  then,  is  to  find  c 
and  m  which  will  make  ^ 

(7n*,+c-ytf  +  l^nx^c -frfi-         ----  -hbn 

a  minimum. 


f 
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If  we  consider  this  an  expression  in  cT  differentiate,  and  set  equal  to 
zero,  we  will  have  the  value  of  c  which  makes  the  expression  a  minimum. 

(mx,  +  c-  ft)  -h(^m  ya  -t-  £  -  y?)  *  -+  (  7*      +  c  -  y„  J  -  o 

This  equation  Dasses  through  the  oointfx,  y^  the  mean  of  the  entire 
distribution.     This  suggests  that  we  transpose  the  origin  to  the  point 
f x;  yiso  that  x  -  x  -  d  and  y  -  y  -  k,    now  in  the  equation  autfc  -  y  =  o, 
the  value  of  c  ■  o. 

Now  returning  to  (rax,  t  c  -  y,  )  +  ~t~  (mx^  f-c-y^ )  and  differentiat- 

ing with  respect  to  m. 

Now  replacing  the  x's  and  y's  by  their  deviations  from  the  mean  (i.e.  trans- 
posing the  origin  to(x,y). 

a,   +       1-  ~  --  +  j-**-** 

M  =         "  r —   7  *■ 

4,X  -h  +   ~  +-  tt-n 

now  if  ^j,   ~h  d^^^ 

71/ 


then         =  2L£    ,  P_ 

OFT* 

Thus  we  have  shown  that  if  we  considered  the  equation  y  =  mx-f  c  and 

transferred  the  origin  to  x^y    this  equation  would  be 

$  =   sri^  ct  +  & 
or 


( 


t 


and         C   -     O  ,  In  = 


(77 


If  x  -  x  :  1,  then  y  -  y  ■  ,  so  measures  the  change  in 

the  deviation  of  y  corresponding  to  a  unit  change  in  the  deviation  of  x. 

If  we  repeated  the  entire  discussion  interchanging  the  x's  and  y's  we 

would  errive  in  exactly  the  same  steps  at  the  result 

(x~x)  =  -fr* 

-        P  P 

Thus  if  (y  -  v)  s  1,    (x  -  x)  •  -pr^  ,  so    ~z^3_    measures  the  deviation 

in  x  from  the  mean  of  x  corresponding  to  a  unit  deviation  in  y  fr  om  the  mean 

of  y. 

f  P 

Therefore,  either    ^=ta    or  may  be  considered  as  good  measures  for 

a  P 

the  correlation  between  x  and  y;  they  are  not  alike  because  gives  the 

p 

change  in  y  corresponding  to  a  unit  change  in  x  and        ^  gives  the  change  in 

~Y  corresDonding  to  a  unit  change  in  y.      If  we  wish  to  compare  these  change 
we  must  reduce  them  to  ratios  which  will  be  comparable,  so  we  divide  x  -  x 
by  the  standard  deviation       find  y  -  y  by     0^.  and  compare 


Divide  both  sides  by  ^H, 

>-»  .-  ^  ( 

Similarly  for 
Divide  by  CT^ 


Y-  V 


B  £-  (VJt) 

Now  we  have  as  the  measure  of  correlation  and  write    Sr,  - 

Now  substituting  r  in  our  equations,  we  have 


P 


Ox 

which  are  the  equations  of  the  lines  of  regression  of  y  on  x  and  x  on  y 
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respectively. 


C.    A  Development  of  the  Regression  Lines  Introduced  to  Show  the 
Range  of  Values  of  r.      An  Exact  Definition  of  the  Correlation 
Coefficient. 

(""Mathematical  Part  of  Elementary  Statistics"  Camp) 
If  we  think  of  the  general  case  of  correlation,  we  could  think  of  the 
data  represented  by  dots  spread  over  the  pacer.    We  wish  to  find  the  equation 
of  that  straight  line  which,  on  the  whole,  will  come  nearest  to  all  these 
dots.     That  is,  if  we  let  o    be  the  distance  between  a  dot  and  the  line,  we 
wish  to  make  £  h    a  minimum. 

There  are  three  cases,  depending  on  whether  (case  a)  d  is  measured 
parallel  to  the  y  axis;  (case  b)  d  is  measured  parallel  to  the  x  axis; 


(case  c)  d  is  measured  perpendicular  to  the  line. 

+-  — ?  +■  \ 

Y  0]  \  


! 


O 


f 


1 

+ 


r 


x 


(a)  The  regression  line 
of  y  on  x. 


(b)  The  regression  line     u      (c)  Camp  calls  this 
of  x  on  y.  the  "geometrically 

best  fitting  line." 


A  Method  of  finding  Lines  (a)  and  (c)  by  Means  of  Least  Squares  -  Introduced 
to  Show  the  Range  of  r. 

Case  (a)        To  obtain  the  equation  of  the  regression  of  _J_  on  JT. 


(1 
(2 
(3 
(4 
(5 


Here  we  wish    jL  %  j    to  be  a  minimum,  where  f  represents  the  frequency. 

Let  y  r     Pi  1-  &  x  ( 

Then     f-=    ft  -t~B  X,  ~  ft  i 

N        J  N 


t 


(7)  J^ifr  f-  A^f  ^"'^  £  8^  ^* 

(8)  The  expression  on  the  left  of  (7)  is  the  sum  of  squares  and  is  positive, 

so  the  expression  on  the  right  is  positive.     So  if  we  take  A  =  0,  we 
can  then  find  what  value  of  B  will  make  this  expression  a  minimum. 
That  is,  we  wish  to  make 

(9)  (T^   *(Tj  (  ^  ^  ^  )      a  minimum. 

(10)  This  expression  will  be  a  minimum  when     Q    —  &  $  A>  <^^-is  a  minimum. 

(11)  Differentiating  with  resoect  to  B  and  equaling  to  zero,  we  get 

(12)  Q  =  ±*J£ 

(13)  Substituting  (12)  in  (2)  and  noting  that  A  -  0  we  get 

This  is  the  equetion  for  the  regression  of    ^£  JL 


Case  (b) 

In  the  same  manner  interchanging  the  y's  and  x's  we  get  the  equation 
of  the  regression  of   J£T  "]£_ 

(14)  Now  in  (7)    we  had 

but  A  =  0    and    E  =  r  JjL- 

(15)  Hence   _L  ^  ^  jL  -    T^V  ^iCL^   ^*    ~  ^  ^  ip^  J* 

Nov;  the  left  side  of  this  equation  is  positive,  since  it  is  the  sum  of 
sauares,  so  the  right  side  is  positive.  Hence 

O 


I' A?  > 


and 

—  /     A.     Ji)     A  / 


Also  in  similar  manner  in  the  case  of  the  equation  for  the  regression  of 
on  ^we  may  show 
A/ 

and    -  /    4-  As   4  / 
Now  to  return  to  Case  (c)  and  to  find  the  equation  of  the  "Geometrically 
best-fitting  line." 

(1)  In  Analytic  Geometry,  the  formula  for  the  distance  from  the  point 

( X,  y,)    to  the  line         V  +  &       +    6  =  ^  is 

d  =  .        _  -r  where  d  is  positive. 

±  t]  #  +J3 

(2)  Our  equation  is 

ty±  or  and 

the  point  is(xf,  y^) 

(3)  Hence     ^_         ti_X,    ~    %  >   +  ^_ 

(7)  Here  again,  the  left  side  is  positive,  so  the  right  side  is  positive. 

If  we  let  A  =  0,  we  may  solve  for  that  value  of  B  which  will  make 
 <*■  a  minimum. 

(8)  In  this  problem,  however,  we  are  interested  only  when  the  standard 

deviation  is  used  as  a  unit,  so  we  first  set     d~x  ~  rF~^  —  / 

(9)  Therefore  (7)  becomes 

ft*  +  I-  a  &sv 

and  we  wish  to  find  the  value  of  B  which  will  make  this  a  minimum. 

(10)  Rewrite  (9) 


21. 


(ll)    Differentiating  with  respect  to  B  and  equating  to  zero 


i:  1 

"hen 

yo  >  o 

/  - 

£  &  So 

will 

be  a  minimum  when       /3  ~  / 

"iThen 

y  V     \  O 

/  - 

will 

be  a  minimum  when      J3  ~  —  / 

When 

Aj  -  ° 

/  - 

/ 

and  cannot  be  a  minimum. 

(13)    So  we  may  write  the  equation  for  the  "geometrically  best-fitting  line" 
y  ^    x  if    yv  >t>  and     ^  r  ^  / 

y--£  if     As  <  O  and     r7~^  =  (7^.  =.  / 

Now  if  we  let   (7*  -  (7"L  »  /    end  rewrite  the  equations  for  the 
regression  lines,  we  have 

(a)  The  equation  of  the  regression  of  ^ 

(b)  The  equation  of  the  regression  of  Cr*^ 

(c)  The  equation  of  the  "geometrically  best-fitting"  line 

*  ^  X      if    ^  >  o 

v  -*  lf  2  _  _  , 

(d)  When  <T    is  measured  parallel  to  the  y  axis  £  c  j;  -  /  ^ 

(e)  When  ft     is  measured  parallel  to  the  x  axis  £  <j  -f  /~-SU 

/v  ' 


(f)    ''"'hen         is  the  perpendicular  distance  from  the  ooint  to  the  1 


ine 


when  Jv  y  o 


so  / 


A/  J  £ 


>  <  o 


Therefore,  from  (d),  (e),  and  (f)  we  may  write 

Th.      jjL>  I  measures  the  closeness  with  which  the  dots  cluster 
about  the  geometrically  best-fitting  line;  r      measures  the  closeness  with 
which  the  dots  cluster  about  the  regression  lines  (distances  in  the  lest  ca 
being  measured  parallel  to  the  y  and  x  axis). 

Nov/  we  are  in  a  position  to  prove  the  following  theorem. 

Th.        If    &x  -  Tty  ~  /  ,  the  line  y  =  x,  when  r y  o,  bisects  the 
engle  between  the  lines    ^^^L-X  and   X  -        ;  when    ^  K  O  ,  the  line 
1^.  -       bisects  the  angle  between  the  lines       -  sl  X  e*^^L    X  ~ 

We  shall  prove  the  first  part;  for  the  proof  of  the  second  part  would 


be  the  same  only  we  would  be  working  in  the  fourth  quadrant. 


9*  + 
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Proof; 

(1)  Since  there  is  no  constant  term,  these  lines  all  pass  through  the  origin. 
(Note  the  positive  direction  of  y  is  downward) 

(2)  The  equation  for  the  line  (a)  is 

^  -  so  r 

(3)  Similarly  for  (b)      \>  -  ^ 

so   ^r,  -  ^ 

(4)  ,'.     from  (2)  and  (3)  £  -  £^  ^ 

(5)  The  line  y  =  x  bisects  the  angle  x  o  y 

(6)  So  the  angle  between  (a)  and  (c)  is 

The  angle  between  (b)  and  (c)  is 

(7)  ,'.     Since        (f  -  O 

Aj^°  -   sf  =     ^5°  -  & 
And  we  have  proved  that  the  line  (c)  bisects  the  angle  between  (a)  and  (b). 
We  may  now  write  the  following  theorem. 

Th.      Let  the  standard  deviation  be  chosen  as  the  unit,  then  the  coefficient 
of  correlation  measures  the  degree  to  which  it  is  true  that  a  change  in  one 
variable  determines  an  equal  change  in  the  other. 
Proof: 

(1)  We  have  shown  that  r      measures  the  closeness  with  which  the  dots 

cluster  about  the  regression  lines  and  f  Aj  )  measures  the  closeness 
with  which  they  cluster  about  the  geometrically  best-fitting  line. 

(2)  Also  since  ,  as  r  increases  from  0  to  1  the 

lines  (a)  and  (b)  start  from  coincidence  with  the  x  axis  and  y  axis, 
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respectively,  and  rotate  with  equal  angular  velocities  in  the  direction 
of  c.  "Vhen  r  «  1,  they  coincide  with  c. 

(Note  when  r  ■  -1,  these  lines  coincide  with  (y  =  -x) 
(3)    For  points  on  c    a  change  in  one  variable  determines  en  equp.l  change  in 
the  other. 

pi       u  '  t 

For  the  slope  of  the  line  is    /  and  if  X  ,  ^      and    Y  /    Lj        are  two 

points  on  it 

x"-/' 

If  t 


"  ■    -  x  -  X 


(4)    Therefore,  the  larger  r  is,  the  nearer  (a)  and  (b)  come  to  coincidence 
#  with  (c);  the  nearer  the  dots  lie  to  c;  and  the  nearer  we  have  the 

condition  that  an  equal  change  in  one  variable  produces  an  equal 

change  in  the  other. 

We  might  now  sum  this  discussion  with  the  following  statement  of  the 
above  theorem. 

Th.       The  coefficient  of  correlation  measures  the  degree  to  which  it  is 
true  that  a  relative  change  in  one  variable  determines  an  equal  relative 
change  in  the  other.     By  a  relative  change  is  meant  the  ratio  of  the  absolute 
change  to  the  standard  deviation. 

D.     The  Standard  Deviation  of  the  Arrays 

1'Then  we  have  found  the  equations  of  the  regression  lines,  we  are  interested 
in  knowing  the  dispersion  of  the  rows  and  columns  about  these  lines. 

On  page  11,  we  found  that     £  b,  ^-)^~   W ^  ( /~  b^) 

so  we  might  define  the  standard  deviation  of  the  rows  about  the  regression 
line  of  x  on  y  as       (J~x  i^h-^v^~ 

Similarly  we  might  define  the  standard  deviation  of  the  columns  about  the 
regression  line  of  y  on  x  as      ^  ^    U ' '  ^K 
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Chapter  IV  The  Correlation  Surface 


hey  are  dependent  or  independent,  can  \ 


(Elderton  -  "Frequency,  Curves  and  Correlation") 
(Forsyth    -  "Mathematical  Analysis  of  Statistics") 

The  equation  representing  the  correlation  surface  where  the  probability 
of  the  joint  occurrence  of  deviations  of  x  and  y  from  their  respective  means, 
whether  they  are  dependent  or  independent,  can ^be  written 

<^ 

This  equation  is  developed  by  Elderton  in  his  "Frequency,  Curves  and  Corre- 
lation" and  is  assumed  in  this  discussion. 

Our  interest  now  is  to  replace  the  constants  in  the  equation  by  ex- 
pressions which  will  be  of  service  in  interpreting  correlation  tables;  i.e. 
the  standard  deviations  of  x  and  y  and  the  coefficient  of  correlation.  This 
can  be  done  by  finding  the  volume    N    under  the  surface  £  . 

L  *r.  ll  £1  z*4*.** 


r  r 


Let  us  consider  first         ^  ^ 

(3)  A  C  ^  «-  *  y  J 

(4)  Nov;  evaluate   J  C 


rr 


Since  we  will  show  later  that  any  cross  section  of  the  normal  surface 
made  by  a  plane  parallel  to  the  H  3r  plane  or  parallel  to  the  X  5~  plane  is 
a  normal  curve  and  is  symmetrical,     °Z  C 
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(e)  so  y  ?  ^  r  ^  ^  c 

Since  we  integrated  with  respect  to  x  first, 

/  irrf 

we  have  then  in  ^  t  X  C  ^  x    a  typical  y  section;  so  we  may  write 

where      %      is  the  frequency  of  this  particular  section. 

~f  1 


Since  it  m«y  be  shown  that  a  normal  curve  can  be  written  ^  -  m  ^  °* 
we  have 

(T)    ^   =     ^'-S')  : 

(8)  Now  if  we  integrate  ^  ^  *n  e*actly  the  same  manner  we  find 

(9)  Let  _   _   ^_  .a         £  * 

(10)     From  (7)    -X-    ,    ^'-^J      ;      4  (i- ZtF) 

a  ,        a\  /  

(13)     From  (8)     L-  <*;  /      ,      <X  ^(lA?) 

(12)     In  (9)      £  -  _  ^ 


"re  now  have  the  constants  a,  b,  and  h  expressed  in  the  desired  terms; 
only  k  remains. 

Returning  to 

Ar     J-y)J   a0  f  and  substituting  for 

*"  its  value     i  C~  **t  ,  we  have 


"0 


2_ 


(14)  J '  2  ^  -1 


u 


• 


• 
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(16)    In  (11)       a.  ckao 


(17)  yTL  =   

™  ^//^          ^/  ^ 

(18)  Nov/  return  to  the  equetion       .  /?   a  \ 

and  reolece  a,  h,  k,  b    by  their  respective  values  we  have  2_  \ 

■ 

This  is  the  equation  for  the  correlation  surface  and  r  is  the  coefficient 
of  correlation. 

The  equation  for  the  normal  curve  with  the  x  axis  as  its  mean  may  be 
shown  to  be  _JL^ 

and  that  with  the  y  axis  as  its  mean  ^ 


4M 


N 


iTDr 


e 


a. 


so  the  probability  of  the  deviation  of  any  y  from  its  mean  is 

and  the  probability  of  the  deviation  of  any  x  from  its  mean  is 

Hence,  if  these  probabilities  are  independent,  the  orobability  of  the 
joint  occurrence  of  these  deviations  is  2_ 


;se  deviations  is      2_  .  2-  \ 

(19)      n     c     ^  e     *C<^  °j*J 

L  Ox  <T»  ATT 


which  might  be  considered  the  equation  of  a  surface. 

Now  in  (18)  we  showed   (  07 (TZ  CT7, 

£r   .  c 


• 
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was  the  probability  of  the  joint  occurrence  of  deviations  of  x  and  y,  whether 
dependent  or  independent. 

If  we  consider  the  case  where   ^    ^    and  Aj  then     j£  -         and  we 

see  that  when  Jb-&   ,     o?=3~l  .     That  is  when     y^t-^j^  is  the  formula  for 
the  orobabilities  when  they  are  independent.     This  would  suggest  that  we 
consider  r  a  measure  for  the  dependence  or  correlation  of  the  variables  x  and 

y- 


% 


■ 
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Chapter  V        The  Product-Moment  Formula  for  Correlation. 


(Jones  -  "A  First  Course  in  Statistics"    P.  278) 


yV    c    f</^;  -#.  -y 

Since    ^  ~  ^  IPO^^  (JT^  ,  and  if  we 

suppose  we  have  n  pairs  of  associated  values  of  x  and  y 

(V.  )   (Xn,  Yn) 

then  any  x/  would  occur  with  its  y,   in  the  relation 

5  _  — —   c  ~  ^'-^)  ^  ^         (T7vz.        az^  J 

The  probability^  thet  each  x  would  occur  with  its  associated  y,  assuming 
the  associated  pairs  ( xh  y,  ),  (x^y^)  -  -  -  -  (  were  observed  inde- 

pendently would  be  j_ 

 '  \H     /  £  U  V  .......  Q 


Let    K  s    ^  ^  and  substitute  //  A)  -  JjufiAl 

*Y?  ill  lHt1  *  4T 


But 


■W   j, 


% 


1 


But       (/->t>   /  -  (" 

This  probability  would  be  the  greatest  when 

^        (7  I -AT  is  the  least. 

Differentiate  with  respect  to  r  and  equate  to  0.        This  gives  the  value 
of  r  which  will  make  the  expression  a  minimum  and  will  make  the  probability 
the  greatest.  ( I  ~  y^*)  (~  4)  +0     -^L)C^)  Q 


2  i-^ 

As  =  & 

The  first  derivative  is     r    -  kr    -h  r  -  k 

The  second  is     3  r    -  2  rk  ~h  1      and  when  r  =  k,  this  is    r  -/-l. 
Therefore  the  above  expression  is  a  minimum  when    r  =  lc. 

Hence  the  probability  of  the  occurrence  of  the  associated  pairs  of 
x's  and  y's  is  the  greatest  when 

L  r  aj  =   

■re  assumed  that  the  values  of  x  and  y  were  associated  and  sought  a 
value  for  r  which  would  give  the  maximum  probability  that  for  any  x  we 
would  obtain  its  associated  y.     This  leads  to  a  formula  using 
which  formula  we  proceed  to  develop. 


4 
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The  Product  Moment  Formula 

(Forsyth  -  "Mathematical  Analysis  of  Statistics") 


If  we  define  the  product-moment  of  a  surface   £      J-  by  the 

relation 

and  if  we  take  the  correlation  surface  .  » 

then  the  product  moment  of  the  correlation  surface  about  ^  ,  the  centroid 
vertical  is:  a  \ 

(3)    Now  consider  only       /         A  c 

-  <5C 


(6* 


2-  ^-J> 


,         -i  (*++*** j  *  _  4, 

(6)    Now  consider  -  /     /       e  L  ^ 


-5  L    <*-  £ 


(8)   -  -  4  C 


•/  -A 


 "  -  *»  / 


(9)    Now  consider  only 


• 
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shown  in  section  ^         7  J^^lf^  °*  ^ 

(12)  so       v  +      M_  _  ^  tf^  c.~^ 


(13)     Substituting  (12)  in  (2)  ^  ^ 


(18)     In  Chapter  IV,  we  found     0lo  yz-A^K)  <r£  S~~  ^  oZ  6 


h  -. 


k  = 


/y  

51 


(19}     Sn  -    ^  •  — ~  "^^T       •    /T~  J 


(»)   ,    4  ,  tfTY<-^) 

(21)    r  =     l  x 
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Chapter  VI  Some  Interesting  Points  Arrived  at  by- 

Considering  Normal  Correlation. 

(Jones  -  "A  First  Course  in  Statistics" 

Chapter  XIX  ) 


Frequency  Surface  for  Two  Correlated  Variables 

Before  discussing  the  frequency  surface  for  two  correlated  variables, 
it  might  be  helpful  to  summarize  briefly  some  important  features  of  the 
normel  curve  of  error. 


The  equation  of  the  normal  curve  is  y 


A/ 


[  A  TT  (H( 

where  y  dx  measures  the  frequency  with  which  a  variable  deviates  from  the 
mean  by  an  amount  lying  between  x  and  (x^-dx)  ;  i.e.    ydx  measures  the  fre- 
quency of  "error"  of  size    x    to  (x  +-  dx) . 


The  probability  of  an  error  lying  between    x#    and  x^    is  given  by  the  ratio, 
frequency  of  all  errors  between  the  given  limits 


frequency  of  all  errors 


The  frequency  of  all  errors  is      /  (j 
Therefore,  the  probability  of  frequency  of  errors  between  x  and  x  -/-dx  is  Jf_^^_ 


AV 


l%Vr  Ox 


e 


< 


Frequency  Surface,  showing  the  distribution  of  two  completely  independent 


variables  each  sub.iect  to  the  normal  law. 


if  X,  1  be  taken  as  the  origin  and  x  and  y  the  deviations  from 
^ J  respectively, 


2r 


then  the  probability  of  a  deviation  between  x  and  x  ~>"~dx  is 

e  J 


and  the  probability  of  a  deviation  between  y  end  y^dy  is 


c 


Jt—. 


Since  the  two  distributions  are  independent  .  the  probability  of  their 


occurring  together  is         a  , 


few  <r x 


e 


If    N    is  the  totsl  number  of  observations,  the  frequency  with  which 

such  deviations  occur  together  is 

A/    ^      *     (77       rr.*  ) 


A' 


if  *  a*      -  — -^^V-  c 

0        o  zirtficTy 


Z-  %-  \ 


then 


3-  - 


Al 


C 


This,  then,  is  the  equation  of  the  frequency  surface  when  the  variables 
are  independent. 


Now  let  us  discuss  this  surface  to  see  what  it  is  like. 

if  y  =  y. 


-i  &  7    " JU 


£  77*  (7>(7> 

/Sir*;  ^  <?} 

2 

Put         '    tfjjp  ^s         equ&tion  of  e  normal  curve 

in  which         ,  (fx  are  no-t  effected  by  different  values  of  y.  Hence 

all  arrays  of  _K  are  similar,  having  the  same  mean  and  the  same  standard 
devie  tion. 

Since  the  surface  is  symmetrical,  the  same  may  be  said  for  e]l  the 

B  r ray s  of  jy> 

Furthermore,  if  a  constant 

J 


,  X  t    +    JL_  ) 


Since  the  left-hand  side  is  a  constant,  the  right-hand  side  is  also, 

so  ^,J>-  L 

  ^  JL  (3  a  constant. 

t 

That  is  the  equation  of  an  ellipse;  thus  we  may  say  where  x  and  y  occur 
th  the  frecmency  k,  the  points  (x,  y)  lie  on  an  ellipse  in  the  plane  £  ~  ^ 


wi 


or 


fat  -  ^ 

J         +  -  defines  the  locus  of 

1  <f?  a? 


« 


points  where  x  and  y  occur  with  the  same  frequency.       If  we  vary  k,  the 
frequency,  and  consequently  vary  c,  we  get  a  series  of  ellipses.      If  we 
project  these  orthogonally  on  the  plane  _2!7  0       we  would  get  a  series  of 
concentric  similar  ellipses.      This  enables  us  to  draw  the  surface 


Frequency  Surface  for  Two  Correlated  Variables 


If  we  consider  the  variables  JL  and         and  take  3T  ,    ^     as  the  origin 
so  that  JL  -  X  =  X     and  ~%  -~%    *  ^-    ,  then  the  line  of  regression  giving  th« 
best    y    corresponding  to  any    x    is        ^  -   ^  ^ 

If  we  consider  ~h    the  error  made  in  taking  y  from  this  equation  instead 

or 

of  the  observed  y,  then  ^  -  ~  g—  X  .  Thus  for  every  (  X,  ^)  there  is 
an   Y\    and  the  same  )\  occurs  as  often  as  any  pair  {X,       is  repeated;  thus 

the  frequency  distribution  (*)^)  is  exactly  the  seme  as  that  of  (X,^)>  The 

y  /  v  u  \ 

correlation  between  l/i   and  X     is  and  should  equal  zero. 


=  /VP'  /VP 

Therefore,  the  variables  x  and  \\   are  independent  and  the  probability 
of  their  occurring  together  is  the  product  of  their  separate  probabilities. 
The  probability  of  a  deviation  between  x  and  (x  +dx)  occurring  if  we 

consider  x  alone  is  ^ 

 X 


and  the  probability  of  a  deviation  between  V\  a"d  ( 1\  is 


frrr 


The  probability  of  a  combined  occurrence  of  these  deviations  is 


2_ 


(2) 


But  //  C7T^=  *  ^  "  ^  -  , 


Similarly  if  g  is  the  error  made  in  estimating  anv  x  from   X  ^jji 

then 

(4)      Af    TA         fV  (JJ    ( \-A,  ) 


7- 


.    \  I  0~~u.  /  -  >  0  X 


(7)    Al.o_Lr    +    ^1     -^3    (  M 


X     v/  „ 

J7* 


_J  

77^ 


3 

(8)     Substituting  (?),  (4),  (6)  and  (7)  in  (2)  ^  I 


This  is  the  probability  of  the  combined  occurrence  of  deviations 
y  to  (A'+^),  ]fy  to    (1{+4m).     now  if  we  substitute  (3)  and  (4)  in  (8)  we 
get  the  frequency  of  the  combined  occurrence  of  deviations  A' to  {X  4- *£f£.) 


2_ 

y 


and  if  to  {ffi-  *H)  _  JX  y  v   -!—  + 

Thus  if      dx  dy  represents  this  frequency  where  N  is  the  total  number  of 


observations  /    /  JL —   4-    dt       —    "    \  /     .  z 

of  TV  fi^X1  77  Tvj 

This  equation  represents  the  frequency  surface  for  two  correlated  variable 


t 
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Let  us  now  study  this  equation  in  order  to  learn  some  of  its  interesting 
points. 


Let    t  (a  constant)  =  (JJZa^    0J  (Tu 


and  consider  the  surface  ug—    _    n  "\ 


If  we  let    y  =  yf,  we^aw,.     ^  -  3  4,  * '  £l  \  JT, 


t  .  c  &^  •  ^  **J 


(1) 


This  is  the  equation  arrived  at  by  taking  the  equation  of  the  normal 
curve   st  -     _  .</, ^  ,  an  equation  in  x  and  z  in  the  plane 

y^  y,  ,  ana  shifting  through  a  oistanoo  ,  y,  ££  along  an  axis  parallel  to 
£J"X~  .     The  equation  is  that  of  a  normal  curve  with  a  standard  deviation 
(Jx  iff-*?'       and  the  mean  at  the  intersection  of  the  planes  y  =  y  and 

x  =  r  y    ^^r-  .     So  the  greatest  frequency  in  this  particular  distribution 

<t 

is      7-  _  ^J^— *    »  determined  by  the  intersection  of  the  two  planes  above. 

if  y  =  o  y  ^ 

This  is  a  normal  curve  with  a  standard  devietion  "-^^and  the  mean  at 

the  origin  where  ^t-t~  .     This  mean  may  be  considered  the  intersection  of 
the  plane  y  =  0  and|rr  ^JL—  ■ 

Thus  the  planes  giving  the  means  of  the  x's  corresponding  to  particular 


values  of  y  meet  3:  s 


0  in  the  regression  line 


=  Jo 


Thus  the  x  arrays  ell  have  the  same  standard  deviation   (j  X  uT^-aJ^, 
and  all  have  their  means  at  the  intersections  of  the  planes  through  particula 
velues  of  y  and  the  plane  through  the  lines„^=  0  and  the  regression  line 

Similarly  it  can  be  shown  that  all  the  y  arrays  are  normal  distributions 
have  the  same  standard  deviation  (7^-  fT~~^v  and  have  their  means  at  the  inter 
section  of  the  planes  through  particular  values  of  x  and  the  nlane  determined 
by  the  lines  1  -  0  and  the  regression  line    y  = 

(Tx 

If  we  consider  the  equation,  ^        ^  ^  X  ft  ~\ 


% 


40 


and  let  at    equal  a  constant,  k,  then 

So  all  values  of  x  and  y  which  occur  together  with  the  same  frequency  define 
noints  which  lie  on  the  above  ellipse.      We  may  study  these  ellipses  by  means 
of  the  transformation 

which  is  equivalent  to  an  orthogonal  projection.      In  this  transformation, 
the  equation  of  the  ellipse  becomes 


In  the  plane ^  =  0,  the  regression  lines  would  become 


{ 


The  equation    x'      ^  $         ^  ^  ^  ty  *s  symmetrical  about  the 

lines  ^  =  y  and  ^  =  ~X  ,  hence  the  axes  of  the  ellipse  lie  along  these 
lines.     Turning   X  '    ^"  ^    "         X  ^  T  &    through  an  angle  of  45       ,  we  get 


2  ^ 


so  the  equation  becomes  . 


$      -  ^  (       '  ft  "  *)  (— 

i .  I* 

+      *  —    =  i 


Hence  the  semi-maior  axis  is     /  -(=^—  ,  and  the  semi-minor  axis  is 

'    /—  si/ 

C 

f^/jj      •      As  r  increases  from  0  to  1 ,  the  semi -major  axis  increases 
from   f(L    to  and  the  semi -minor  axis  decreases  from  ff  d,     to  ; 

as  r  decreases  from  0  to  -1,  the  semi-maior  axis  decreases  from    /  C  to 

JL. 


40A 


• 
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and  the  semi -minor  axis  increases  from  to  ^  . 

The  ellipse  X         Y     "cfuX  y  -  Cmay  be  written  as  the  two  equations 
y  X  +  ft'  -  C  and      X  '  ^  '  -  &  ,  so  they  ji^M ,  for  different  values  of 
through  the  intersection  of  the  circle  with  the  center  at  the  origin  and 
L        as  the  radius  and  with  the  x'  and  y  '  axis. 

"Tith  the  picture  on  page  40A  end  on  page  39,  we  are  able  to  visualize 
the  correlation  surface  of  two  correlated  variables  for  any  r. 

The  equations  we  arrived  at  by  projection  were    X       ~f~    ft     ~  <*      y  ft  -G 

i  f 

the  locus  of  the  paired    x  ,   y    where  £  =  k,  and  the  regression  lines 

////'  '        /  f  / 

y    =  rx  and  y    =        X   ,  and  the  axes  of  the  ellipses  y    =  x    and  y     =  -x  . 

'       '  '  ' 

Now  y    =  x    and  y    =  -x 

form  a  harmonic  oencil  with  x    =0  and  y    =0  for  the  interior  and  exterior 

bisectors  of  an  angle  form  a  harmonic  Dencil  with  the  sides  of  the  angle. 

Also  y    =  x      and  y  »  -x    form  a  harmonic  pencil  with  y    =  rx    and  y  =^X. 

^ow  harmonics  are  preserved  in  an  orthogonal  projection,  so  if  we  oroject  back, 

*A  _  jl         jl  *  -  JL. 

are  harmonic  with 


and 


-  c 


JL  -  A-  .    it.  -  •  -  * 

C?^       <Tx  (Tt 


are  harmonic  with 

Thus  we  say  the  two  lines  of  regression  corresponding  to  maximum  corre- 
lation (  r  =         r  =  -1)  are  harmonic  with 

(1)  The  axes, 

(2)  The  lines  of  regression  for  any  r. 


• 
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In  other  words,  the  lines  of  regression  corresponding  to  maximum  correlation 
bisect  the  interior  and  exterior  angles  formed  by  the  lines  of  regression 
for  any  r,  a  fact  which  we  have  oroved  in  a  previous  section. 

We  have  shown  that  in  an  ideal  distribution  the  means  of  the  rows  and 
the  means  of  the  columns  all  lie  on  the  regression  lines  and  in  our  previous 
work  we  have  generalized  this  by  assuming  that  the  distributions  we  have  had 
were  so  chosen  that  if  there  were  an  infinite  number  of  paired  values  of  x 
and  y,  they  would  form  an  ideal  distribution. 


3 


Chapter  VII 


Computation  Formulas  for  the  Coefficient  of 
Correlation.  Problems. 


Having  developed  the  formula 


■v  - 


where 


we  may  rewrite  this  in  several  ways  which  will  aid  in  the  numerical 
computation. 


(1) 


X 


Formula  I 

Origin  at  the  true  mean. 


by  substituting  ,  V*  -  "fr^fi 


(3) 


I 


0*  JJ 


where        KY  -  ~  JL 

ft  =  jj  /  -  g; 


Formula  III 


Origin  at  True  Mean. 


(4)  Now  to  rewrite  this  formula  so  that  the  deviations  will  refer  to 
some  point  other  than  the  true  mean  as  origin 


44 


1 


 4 


Of,*) 


£7 


ox: 


Let    h  =      X  -  >T 

In  the  formula     r  =  ■ — r—z  x  and  y  represented  the  deviations  from 

the  mean,  and  in  the  above  h  and  k  are  the  respective  deviations,  so  we  may 


rewrite 


#7*  Tu. 


80    fe  *    i  (x^}  ~  I  Lit 


(5) 


^  - 


Formula  IV 

Origin  at  an  arbitrary  noint. 


c 


• 
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Now  for  one  more  formula. 


cnr  -   (fir  *  ( 

A                  '  referred  to  the  mean  as  origin. 

Or  if  in  the  above  diagram  h  =  x  -  \ 

x»    _  "2- 

z    i-  CXi  '      ~  J    *'    77  X             <??  * 


5^ 


-     ^  X 


Similarly 


Summary  of  Formulas  for  r. 


Formula  V 

Origin  at  an  arbitrary  point, 


"here  x  and  y  are  deviations  from  the  true  mean. 
I  r  - 


_  2-  x  ^ 
hf    


II 


r  = 


(77 


^Ttiere  x  and  y  are  deviations  from  an  assumed  mean. 


in 


IV 


r  = 


r  s 


r  = 


/v 


(77  ^ 


I 


46 


of 

Problem:     Correlation  the  Scores  Received  by  106  Reading  School  Pupils  in 
the  Henmon-Nelson  and  THe  Terman  I.  Q.  Tests. 


Pupil 


Henmon- 
Nelson 


Terman 


Pupil 


Henmon- 
Nelson 


Terman 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
47 
48 
49 

I? 

52 
53 


149 

144 

141 

139 

136 

135 

135 

134 

132 

131 

130 

130 

130 

129 

128 

127 

126 

125 

125 

124 

123 

121 

120 

120 

120 

120 

117 

117 

117 

116 

115 

115 

115 

114 

114 

113 

113 

112 

li2 

112 

111 

111 

111 

110 

110 

110 

110 

109 

109 
109 
109 
108 
108 


134 

122 

132 

142 

125 

139 

139 

121 

123 

122 

122 

124 

123 

131 

126 

105 

132 

120 

114 

124 

130 

119 

111 

118 

117 

120 

128 

102 

108 

125 

126 

114 

112 

122 

118 

120 

119 

126 

123 

110 

120 

111 

109 

115 

104 

117 

121 

127 

121 
106 
112 

IH 
102 


54 

55 

56 

57 

58 

59 

60 

61 

62 

63 

64 

65 

66 

67 

68 

69 

70 

71 

72 

73 

74 

75 

76 

77 

78 

79 

80 

81 

82 

83 

84 

85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 

100 

101 

102 
103 
104 
105 
106 


107 

107 

107 

106 

106 

106 

105 

105 

104 

104 

104 

103 

103 

103 

103 

102 

102 

102 

102 

101 

101 

101 

101 

100 

99 

99 

98 

98 

98 

98 

97 

96 

95 

95 

95 

94 

94 

91 

91 

91 

90 

90 

90 

88 

87 

87 

87 

84 

80 
80 
77 
77 
75 


106 

114 

109 

112 

116 

104 

117 

114 

117 

130 

119 

109 

110 

109 

112 

109 

125 

112 

101 

103 

112 

112 

109 

98 

100 

110 

99 

104 

.117 

106 

110 

92 

97 

97 

124 

103 

109 

101 

104 

105 

100 

104 

110 

95 

93 

99 

97 

91 

105 
94 
80 
83 
84 


- 

• 
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Chapter  VIII 


Correlation  From  Ranks 


Introdu ction 

"?rhen  the  data  we  are  using  expresses  the  measurements  merely  by 
the  order  or  rank  of  the  individual  in  the  series,  the  nroduct- 
moment  formula  for  correlation  is  of  no  service  in  determining  a 
measure  of  relationship.     For  example,  consider  the  follov/ing  table 
showing  the  ranks  of  ten  students  in  an  English  and  a  history  test. 


4 

B 

e 

D 

/F 

F 

H 

z 

Rank  in 
English  test 

3 

V 

6> 

7 

7 

Rank  in 
History  test 



3 

¥ 

7 

to 

1 

s~ 

/o 

S 

? 

Differences 
in  Rank 



3 

1 

-6~ 

-/ 

-y 

'"re  will  try  to  show  that  if  D  is  the  difference  in  ranks  of 
corresponding  variables  in  the  two  series  of  N  individuals  then  the 
correlation  between  the  ranks  is  given  by      L  ~      '  ~    ^  v 

Now  it  can  be  seen  easily  that  correlation  between  actually 
measured  variables  can  be  made  to  change  without  changing  ranks.  For 
exemple,  consider  these  series'- 


Variates 


X 

-7 

t 

X 

/ 

<£ 

3 

1 

-4 

-/ 

! 

3L 

Ranks 

/ 

<3L 

3 

y 

The  correlation  of  the  variables  and  the  correlation  of  the  ranks  are 
perfect. 


Variates 


X 

-f.f 

11 

-A 

'■01 

.of 

Ranks 


/ 

3 

/ 

3 

( 


e 
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Here  the  correlation  of  the  ranks  is  still  perfect  but  not  so  the 
correlation  of  the  variates. 

This  would  indicate  that  the  value  of   ("^is  not  worth  much  by  itself 
for  interpretation  and  would  show  the  necessity  of  connecting  ("^with  r, 
the  coefficient  of  correlation  of  the  variates.      We  will,  therefore, 
try  to  show  that  under  the  assumption  of  a  normal  frequency  distribution, 
and  the  assumption  that  grades  may  be  replaced  by  ranks,  the  correspond- 
ing value  of  the  correlation  coefficient  of  the  variates  that  correspond 
to  the  ranks  is  given  by 


II.  The  Formula  / 


a/ 

Reference:     T.L.Kelley  "Statistical  Methods"    P.  191-4 
If  x  and  y  be  the  deviations  from  the  mean  of  two  variables  to  be 
correlated  and  if 


D  /V 


then     T0      0~x  l-  £  *  fx  °^  +  ^ 

r  -      </x*>  ^  *~  (Tp 


2. 


If  *  g—^ 

Now  if  we  ere  considering  the  coefficient  of  correlation  between  two 
series  measured  in  rank  only,  each  series  contains  N  terms,  the  standard 
deviations  end  the  means  of  each  are  equal  respectively.     The  difference 
between  the  actual  ranks  of  any  one  character  v/ould  be  equal  to  the  dif- 
ferences of  their  deviations  from  the  mean,  so  we  may  use  the  above 
formula  where  the  coefficient  of  correlation  for  ranks  is  defined  as 

"  6  -      j^r  * 


«  I 

1 


j 
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Now  to  show 

q —  3~         At  -  /  

On    Page  1  in  the  notes  we  show 

where  S  is  the  standard  deviation  about  an  arbitrary  origin  other  than 

the  mean.     In  this  case  let  this  origin  be  zero,  then 
i  _      /+c£  -h3  *       ■   +  A/     _      A/   /  / 

S    is  really  the  second  moment  of  the  ranks  about  zero,  so  we  may 
determine        by  first  determining^/^,  the  second  moment  about  zero 
where  the  distribution  consists  of  a  frequency  evenly  spread  over  the 
class  intervals,  as  shown  below,  instead  of  being  concentrated  at  -the 
midpoints  as  is  the  case  where  rank  positions  are  used. 


a  i 


The  frequency  distribution  drawn  is  represented  by  the  line  y  r  1  from 
x  =  ijr  to  x  =  N  The  second  moment  of  any  one  rank,  k,  from  0  is  k"2, 

whereas  the  second  moment  of  the  distribution  y  ■  1  from  k  -  \  to 


k  f  h  is 


The  second  moment  of  the  frequency  y  =  1  corresponding  to  the  k~  rank, 
~j  of  the  freouency,  is  ^  too  large,  as  is  true  for  every  rank;  hence 
the  second  moment  of  the  equation  y  =  1  from  x  ■  \  to  x  =  wjh  De 

larger  than  the  desired  second  moment  by  —  (  r0  )  or    /_    .     That  is    ^7  ^~S-h~*~ 

f\)     M  />=<  c/_ 


e 


i 


t3u 

2_ 


The  Formula    r  =  & 

Reference:  Karl  Pearson  "On  Further  Methods  of  Determining 
Correlation"      Drapers    Company  Research  Memoirs 

Biometric  Series  IV. 

Having  found  a  formula  for  f  for  the  correlation  of  ranks,  it  now 

P 

becomes  necessary  to  get  a  formula  which  will  connect  ^  with  r,  the 
coefficient  of  correlation  for  the  variates.  In  the  reference  above 
Pearson  develops  such  a  formula 

^  =  a  .  ^  (ir  e) 

(d 

p 

where,  however,  l-    is  the  correlation  for  grades  and  not  ranks.  The 
rank  is  the  actual  position  in  order  of  en  individual  and  is  assumed 
to  be  at  the  midpoint  of  the  class  interval,  hence  if  the  rank  is  k, 
there  are  k  -  \  individuals  above  that  particular  one  in  the  series. 
Thus  the  grad^e  of  this  particular  individual  would  be  k  -  \,  the 
actual  number  above  it  in  the  series.     Ranks  form  a  discontinuous 
series  with  an  interval  of  1  while  grades  form  a  continuous  series. 
The  formule  above  may  be  used  with  ranks  on  the  basis  of  two  assumptions 

(1)  The  series  we  are  dealing  with  follow  the  normal  law. 

(2)  Grades  of  en  individual  may  be  replaced  by  ranks. 

If  we  consider  a  series  of  N  terms  and  each  of  these  has  a  value  in 


t 
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an  x  and  one  in  a  y  frequency  end  we  wish  to  find  the  coefficient  of 
correlation,  we  may  let        ,   ?n ^  be  the  means  j  (77~  ,  (Tj^    the  standard 
deviations;  r  be  the  correlation;  and  x  end  y  the  deviations  from  the 
means  respectively.     Then  Pearson  defines  ^ 

n         M  N   +  f   XC  ~  *  <U 

where  g^and  g  ^  are  the  x  and  y  grades  of  the  individuals  in  the  series 
and  h  N  is  the  mean  of  each.      Since  g  and  g  _  are  functions  of  x  and 
y,  correlation  between  g,  and  g^  determines  the  correlation  between  x 
and  y  and  vice  versa. 

Now  if  i;  =  gj  -        ;     ia  = 


and  !  i 

>  s  — -         •    £. 

aip<?T<ri  /'/-as- 


then  the  product -moment  of  the  grades  is 

(2)  _  L*  J  **  ^  ^ 

Pearson  has  shown  in  "Philosophical  Transactions"  Vol.  195A  Page  25 
that      d.  ^  ~    (77  ^  .      The  oroof  of  this  required  the 

definitions  and  notations  for  multiple  correlation,  so  it  has  been 
fissumed  in  this  paper. 

(4)     Integrating  twice  by  narts.     (See  Notes,  Page  2) 

A,  ~  aro^  J 2  S 2  *  ^r-  —  Uv- 


t. 


J 


(5)     Substituting  for  and   ^jr^  their  values  end  letting 

x  -  x'tfT,       ti'0^    (See  Notes>  PRSe  3) 


'  {See  Notes,  Page  4) 


✓V3 


{6)     Defining       Q  _     p  /"to  correspond  to  the  product- 

^  <J%'  I   moment  formula  for  correleti 


on 


/ 


(7) 


OAs   '    77^  ir^-A, 

C 


But  since  r  is  the  coefficient  of  correlation  between  x  and  y  and 


A 

is  the  coefficient  of  correlation  between  g^  and  g  ,    (is  zero  when 


r  is  2ero;  therefore  the  constant  shove  is  zero. 
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IV    Problem  Showing  Correlation  From  Ranks  Between  Ten  Students  in  a 
History  and  an  English  Examination. 


Student  Rank  in  Rank  in              Difference  Square  of 

English  Test        History  Test  Difference 

A  1  2  1  1 

B  2  3  1  1 

C  3  4  1  1 

D  4  7  3  9 

E  5  6  1  1 

F  6  1  -5  25 

G  7  5-2  4 

H  8  10  2  4 

I  9  8  -1  1 

J  10  9  -1  JL 

48 
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Chapter  IX  Mean  Squsre  Contingency 


References: 

(1)  "On  the  Theory  of  Contingency  and  Its  Relation  to  Association 

and  Normal  Correlation"  by  Karl  Pearson  Drapers'  Company 
Research  Memoirs,  Biometric  Series,  I. 

(2)  "Statistical  Methods  Applied  to  Education" 

Harold  0.  Rugg  P.  299  et  seq. 

(3)  "Statistical  Methods  for  Students  in  Education" 

Holzinger  P.  273  et  seq. 

(4)  "Introduction  to  The  Theory  of  Statistics" 

Yule  64-67 

(5)  "Introduct  ion  to  Mathematical  Statistics" 

Carl  J.  West  Ch.  13 


I.  Introduction 

In  the  work  with  the  coefficient  of  correletion,  we  were  dealing 
with  measured  quantities,  the  statistics  of  variates.    We  now  turn  our 
attention  to  the  relationship  of  traits  which  are  not  capable  of 
quantitative  measurement,  the  statistics  of  attributes. 

A  simple  illustration  will  show  the  type  of  problem  we  are  now  to 
deal  with.     Suppose  in  a  group  of  eighty-nine  boys  we  wished  to  learn 
whether  there  was  any  association  between  their  school  work  and  their 
behavior  end  that  these  attributes  could  be  tabulated  as  follows: 


School 
Work 

Behavior 

Bad 

Troublesome 

Good 

Excellent 

Good 

3 

9 

12 

14 

Medium 

4 

10 

16 

2 

Poor 

10 

2 

7 

I 

Clearly  the  product-moment  method  would  not  serve  because  we  have 
no  reasonable  measurement  for  the  various  categories  of  behavior.  We 
wish  to  find  some  method  of  measuring  the  amount  of  association  which 
does  not  require  us  to  determine  scales  for  classifying  the  attributes. 


•  9 
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This  method  has  been  developed  by  Karl  Pearson  in  his  coefficient  of 
mean    squere  contingency. 
II.     Contingency  -  Definition. 

If  we  were  considering  the  problem  of  the  relationship  of  two 
attributes  end  classified  them  into  a  number  of  groups  A, ,  A^,  Aj 
-  -  -  -  -  A5  and  Bf  ,  B^  ,  B^  -  -  -  -  -  B-£-,  we  would  form  a  table  contain- 
ing s  rows  and  t  columns,  or  s  x  t  compartments  with  the  total  frequency 
distributed  into  sub-groups  corresponding  to  these  compartments. 


6, 

** 

Hi, 

7T)% 

T*3 

If  the  total  frequency  were  N  and  if  the  numbers  falling  in  the  groups 
A,  ,  A^ ,  etc.  were  n,  ,  n^  -  -  -  n^ ,  respectively  (see  table  ebove)  then 


the  probability  of  one  falling  into  one  of  these  groups  is 


>7 
/V 

))s   respectively.      In  like  manner  if  the  number  falling  in 


N 

the  groups  E/  ,  B    -  -  -  3^  are  M(  ,        -  -  -         respectively,  the 

probability  of  one  falling  in  one  of'  these  groups  will  be    lUi   ,  Jfti- 

ft\      D\*  ,  -  -  -  ^Js  »  respectively.       Therefore,  the  number  in  the  cell 

A^B^  to  be  expected  on  the  theory  of  independent  probability  is 

/y.     71  sl      T^r      -=        ^  1/  7W<L 
N         /v  A/ 

for  the  probability  that  a  measure  will  fall  in  a  row  A*,  is  and 

/V 

the  probability  that  it  will  fall  in  e  column  P^  is        •    ^ence  the 
probability  that  any  one  measure  will  fall  in  this  row  and  column  is 
ere  are  N  measures  so  the  probabilitv  that  any  one 

— Tf* 

will  fall  there  is     /V  •  j2Lk  ,  H^: 

tV  At 


1 
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If  the  number  actuelly  observed  in  this  cell  is     /7^/  <^  then 


measures  the  deviation  from 


N 

independent  probability  of  the  measure  falling  in  the  compartment 

Pearson  points  out  that  the  total  deviation  of  the  whole  system 

from  independent  probability  must  be  some  function  of  /l^a.   ^ — 

for  the  whole  table  and  he  terms  this  total  deviation  from  independent 
probability  a  measure  of  contingency.      Therefore  the  greater  the  con- 
tingency, the  greater  must  be  the  amount  of  correlation  between  the  two 
attributes,  for  such  a  correlation  is  the  measure  of  the  degree  of  de- 
viation from  independence  of  occurrence.        Pearson  then  ooints  out 
that  if  we  define 


we  will  have  a  function  of    7\ajc  ~  —     which  will  measure  the 

degree  of  deviation  of  the  series  from  independent  probability  and 

which  will  bring  contingency  into  line  with  the  customary  notations  of 

correlation.    The  formula  used  above  is  of  the  type  developed  by 

Elderton  in  "Frequency  Curves  and  Correlation"  on  page  141  to  measure 

the  amount  of  agreement  between  two  sets  of  figures.    Here  it  is  used 

to  measure  the  amount  of  agreement  between  our  observed  data  end  the 

data  of  a  table  based  on  chance  alone. 

Definition:        Mean  Square  Contingency. 

A.  2. 
Having  defined  /\  ,  Pearson  then  defines   W  ,  the  mean  square 

contingency  /j)  °*~=  /A 


1 
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III.     Development  of  the  Formula  for  Mean  Square  Contingency. 


C  =   /J  j+rj 


Let  x  and  y  denote  the  deviations  from  their  respective  means  of 
two  attributes,  (t^  t  (T^  are  the  standard  deviations  and  r  is  the  cor- 
relation.   Then  if  the  correlation  table  can  be  approximately  repre- 


sented by  the  normal  correlation  surface,  i~  i 


represents  the  frequency  with  no  correlation  as  previously  discussed 

3t  =    C 


represents  the  frequency  with  which  we  are  dealing;  i.e.  the  frequency 
of  the  observed  deta. 


-—7—  and 
A/  ,  z- 


then       ^x     ^    ^     £  (?Uc 


) 


77i  ~wc 
A/ 


Therefore,  if  we  sum  over  the  entire  table  this  reduces  to' 


(1)  <p\-  r  f"  (Lz^L  ei*  *u 

(2)  Substituting  for  ^  and  jtv  and  let    X  "    i     /    ^  -2. 


+ J 
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(3)     <(>*        '  ~  ~  


7       C^T7^^  ^ 


3- 


This  follows  from  the  fact  that  if      O^c    >  6 


See  Notes,  Page  4 
(4)  Simolifying 


(5) 

Some  important  conclusions  can  be  drawn  from  this.     Elderton  P. 148. 
"(l)    It  shows  clearly  that  r  must  lie  between  -1  end  1. 

(2)  Since  the  value  of  <p  will  not  be  affected  by  the  order  of 
rows  (or  columns),  it  will  be  seen  that  it  is  permissable  to 
interchange  them,  orovided,  of  course,  the  whole  column  (or  row) 
be  moved  at  once. 

(3)  The  proof  shows  that  r  will  not  necessarily  be  obtained 
exactly  if  a  very  small  mumber  of  groups  is  used,  because  by 
using  the  integral  calculus  an  infinite  number  of  groups  was 
assumed. 

(4)  We  al  so  assumed,  however,  that  we  were  dealing  with  a  per- 

3L 

fectly  smooth  series:  but  since /\  is  a  measure  of  goodness  of  fit 
between  the  correlation  and  non-correlation  figures,  a  very  large 
number  of  groups  gives  undue  prominence  to  chance  deviation,  due 
to  the  use  of  random  sampling,  and  the  value  of  r  found  from  that 
of  (f  may  differ  considerably  from  the  value  reached  by  the  xy 
moment.     Too  fine  a  grouping  may  give  a  less  accurate  result  than 
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a  less  fine  one. 


n 


Pearson  further  points  out  that  since 


4> 


is  a  measure  of  deviation 


of  the  series  from  independent  probability  and  therefore  of  the  amount 
of  association  or  correlation  between  the  attributes  involved,  any  function 
of  this  expression  is  also  a  proper  measure.     Therefore,  in  order  to  bring 
the  coefficient  of  contingency  into  line  with  the  notations  used  in  the 
coefficient  of  correlation,  he  defines  the  coefficient  of  mean  squere 
contingency 


Necessity  of  Limiting  the  Use  of  the  Coefficient  of  Contingency  to 
5X5  fold  or  Finer  Classifications.        Yule.     P.  65-6. 


Yule  shows  that  coefficients  when  "calculated  on  different  systems 
of  classification  are  not  comparable  with  each  other.     It  is  clearly  de- 
desireble,  for  practical  purposes,  that  two  coefficients  calculated  from 
the  same  drta,  classified  in  two  different  ways,  should  be,  at  least 
prmroximately ,  identical.      rith  the  present  coefficient  this  is  not  the 
case:  if  certain  data  be  classified  in,  say  (l)  6  •<  6  fold,  (2)  3X3 
fold  form,  the  coefficient  in  the  latter  form  tends  to  be  the  least. 
The  greatest  possible  value  is,  in  fact,  only  unity  if  the  number  of 
classes  be  infinitely  great;  for  any  finite  number  of  classes  the 
limiting  value  of  C  is  the  smaller,  the  smaller  the  number  of  classes." 
Yule  "Introduction  to  Theory  of  Statistics"  P.  65. 

The  proof  of  this  statement  follows: 


(i)    Ax=  i 


/  ''A  t 


A/ 


(2) 


A  x    1-1  ^r^~  V  -  J  * 

^  A/  > 


(3)  Let 


Then 


A/     ~  S  -  N 


Now  supoose  we  are  to  deal  with  a  t  x  t  fold  classification  in 
which  X    =K    for  all  values  of  r;  and  suppose,  further,  that  the 
association  between  the  two  attributes  is  perfect  so  H  It*-*, 
for  all  values  of  r,  and  the  frequencies  in  the  remaining  cells  are 
zero,  frequency  is  then  concentreted  in  the  diagonal  compartments 

of  the  table.      If  we  interpret  our  notation  in  the  light  of  this 
hypothesis,  we  have: 


A/ 

So  we  may  write 


C  = 


This  is  the  greatest  value  of  C  for  a  symmetrical  t  x  t  -  fold 
classification. 


% 


Yule  then  shows  for 


t  =  2 
t  «  3 


C    cannot  exceed 


0.707 

"  "  0.816 

t  =  4        "        "  "  0.866 

t  =  5        "        "  "  0.894 

t  -  6        "        "  "  0.913 

t  -  7        "        *  "  0.926 

t  =  8        "        w  "  0.935 

t  =  9        "        "  "  0.943 

t '=  10       "        "  "  0.949 

so  that  it  is  well  to  restrict  the  coefficient  of  contingency  to  5x  5 
or  finer  classifications  where  the  maximum  value  of  C  will  at  least 
apnroximate  unity. 
Problem. 

The  coefficient  of  mean  square  contingency  may  be  used  for  data 
quantitatively  measured  as  well  as  for  that  which  is  qualitative.  It 
may  be  used  where  one  series  is  quantitative  and  one  qualitative.  Ahe 
following  example  is  from  Rugg,     P.  305,  and  shows  the  steps  in  using 
such  a  coefficient. 

Relation  Between  Mental  Age  and  Pedagogics  1  Age. 


Menta! 

L  Age 

9 

10 

11 

12 

13 

14 

15 

Totals 

Retarded 
2  years 

X 

7 

// 

Pedagog- 
ical 
Age. 

Retarded 
1  year 

/ 

7 

i 

3 

/ 

n 

Normal 

3 

r 

y 

/ 

Accelerated 
1  yea  r 

s 

/c 

6 

JL  3 

Accelerated ' 
2  vears 

7 

3 

I 

/ 

f  V 

Totals 

2 

/  3 

l  f 

// 

3 

A/  -  <f%£ 

c  - 


Table  giving  T7^  ^  ^ 

7v' 


Mental  Age 

9 

10 

n 

12 

13 

14 

15 

Retarded 
1  year 

u 

/.  V*" 

Pedagog- 
ical 

Retarded 
2  years 

V.  (>  f 

Cd  (e> 

Normal 

3.  n 

3. 

Age. 

Accelerated 
1  year 

*/.  Y9 

Accelerat  ed 
2  years 

^•73 

3.^9 

J.  7  3 

The  2.85  in  circle  above  is  arrived  at  "by  the  following 


Table  Showing 


Cental  Age 

9 

10 

11 

12 

13 

14 

15 

etarded 
2  years 

/Vl 

33."/ 

/o 

Pedagog- 
ical 
Age. 

Retarded 
1  year 

0.  Iff 

3-V7/ 

3.717 

Normal 

—  

/S~-  (>/ 

S  /3 

Accelerated 
1  year 

f/ 

Accelerated 
2  years 

/hiss 

3*fr 

£■37? 

S  =  174.656 
N  =  82 
S-N  =  92.656 


♦ 
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Notes 


Page  1 


I.     Root  Mean-Square  Deviation 


0 


Definition  of  root  mean-square 

X. 

or    S      is  the  average  of  the  squares  of  the  deviations  about  an 
arbitrary  origin  A. 
1 


(10 
(11 

(12 


=  jr  -h  *  -  ^ 

Let     X        ^  -  ^ 
Then  X  +  ^ 

/fx) 

the  sum  of  the  deviations  about  the  mean  and 


equals  zero. 

£A£>  „-    £  =     £  frf^ 

— 7T~ 


S 


AT 


Notes 


Page  2 


To  show  x  ^  j , 


(l)  Consider 


Integrating  by  parts 

*****     i^L~J-~  *  ^  ^ 
_  r-  ^a  dj  ^  dr 

^  j  V*  -*u,  -1*  L    *j  s*  ■ 

(3)    Now  consider      f  "°     /      ^  *~    rtu L  cLu 
Integrating  by  parts 

.(4)   *  /  ^  a  ^  r*r*  z-*^.  cU, 


L 
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Notes 


Page  3 


III    To  show     ^  ^'  <f 


/v 


(i)    d  Ppt-  -  07^  /  *  —    ^  ^ 


A/ 


O-^Jx 


3r  = 


A/ 


A/  e 
/ 


-  ^ 


JL 


(2)    <*-  f§*  $ 


Y77  iff  > 


-4/ 


fie  "77=0 
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Notes  Page  4 


IV.     To  show 

/-v>l 


\ 


if     a  c  y  1^ 


(1)    The  index  may  be  written 


(2)  Integrating  with  respect  to  x,  since 

(3)  Now  integrating  with  respect  to  y,  since 

  X  TP 

c  tf  ^•7r  ^     ~  T 

we  have      //  ^JT     .    u    ^ ^    /, '  l/^e,~S 


I 


< 


Conclusion 


In  this  paper  I  have  tried  to  show  something  of  the  development  of 
the  formulas  for  simple  correlation  of  three  important  tyoes;  the  coeffi- 
cient of  correlation  for  linear  regression,  the  most  important  in  its 
frequent  use;  the  coefficient  of  correlation  from  ranks;  and  the  coefficient 
of  mean  square  contingency.     The  first  of  these  to  be  used  where  the  data 
are  represented  by  numerical  measures  and  the  method  taking  full  account 
of  the  value  and  position  of  every  measure  in  the  series,  the  second  to  be 
used  where  only  the  positions  of  the  measures  are  given,  and  the  third  when 
the  data  are  not  in  terms  of  numerical  measures  but  in  the  form  of  attributes. 
I  have  been  interested  in  these  because  I  felt  they  were  suitable  formulas 
for  work  usually  done  in  statistics  in  Education.    A  more  complete  discussion 
should,  of  course,  contain  something  of  the  correlation  ratio  to  be  used 
where  the  regression  is  non-linear,  a  study  of  tests  for  linearity,  and  a 
study  of  probable  errors.     These  topics  would  form  a  suitable  study  in 
themselves . 

In  developing  the  coefficient  of  correlation  by  the  correlation  surface 
method  certain  assumptions  based  on  the  theory  of  probability  were  made,  and 
the  equation  of  the  normal  curve  was  used  without  deriving  it.     These  could 
have  been  given  a  sound  mathematical  basis  but  it  seemed  wise  to  limit  the 
paper  and  give  references  for  their  derivation. 

In  the  chapter  on  correlation  from  ranks,  the  assumption  was  made  that 
grades  could  be  replaced  by  ranks.     Pearson  makes  this  assumption  in  the 
reference  cited  in  that  chapter.     The  sound  method  would  be,  it  seems,  to 
use  grades  exclusively  but  the  work  involved  would  be  extremely  laborious 
and  the  results  not  sufficiently  different  to  warrant  such  an  effort. 
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